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Abstract 

We consider the problem of learning a sparse multi-task regression with an application to a genetic 
association mapping problem for discovering genetic markers that influence expression levels of 
multiple genes jointly. In particular, we consider the case where the structure over the outputs can 
be represented as a tree with leaf nodes as outputs and internal nodes as clusters of the outputs at 
multiple granularity, and aim to recover the common set of relevant inputs for each output cluster. 
Assuming that the tree structure is available as a prior knowledge, we formulate this problem as a 
new multi-task regularized regression called tree-guided group lasso. Our structured regularization 
is based on a group-lasso penalty, where the group is defined with respect to the tree structure. We 
describe a systematic weighting scheme for the groups in the penalty such that each output variable 
is penalized in a balanced manner even if the groups overlap. We present an efficient optimization 
method that can handle a large-scale problem as is typically the case in association mapping that 
involve thousands of genes as outputs and millions of genetic markers as inputs. Using simulated 
and yeast datasets, we demonstrate that our method shows a superior performance in terms of both 
prediction errors and recovery of true sparsity patterns, compared to other methods for multi-task 
learning. 

Keywords: lasso, group lasso, structured sparsity, multi-task learning, association analysis 

1 Introduction 

Many real world problems in data mining and scientific discovery amount to finding a parsimo- 
nious and consistent mapping function from high dimensional input factors to a structured output 
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(a) (b) 

Figure 1 : Tree-regularization for multiple-output regression, (a) An example of a multiple-output 
regression when the output variables form a tree structure, (b) Groups of variables associated with 
each node of the tree in (a) in tree-guided group lasso. 



signal. For example, in a genetic problem known as expression quantitative trait loci (eQTL) map- 
ping, one attempts to discover an association function from a small set of causal variables known 
as single nucleotide polymorphisms (SNPs) out of a few million candidates, to a set of genes whose 
expression levels are interdependent in a complex manner. In computer vision, one tries to relate 
the high-dimensional image features to a structure labeling of objects in the image. An effective 
approach to this kind of problems is to formulate it as a regression problem from inputs to out- 
puts. In the simplest case where the output is a univariate continuous or discrete response (e.g., a 
gene expression measure for a single gene), techniques such as lasso [10] or Li -regularized logis- 
tic regression [[6l Qj]] have been developed to find a sparse and consistent regression function that 
identifies a parsimonious subset of inputs that determine the outputs. However, when the output is 
a multivariate vector with an internal sparsity structure, the estimation of the regression parameters 
can potentially benefit from taking into account this sparsity structure in the estimation process 
such that the output variables that are strongly related can be mapped to the input factors in a 
synergistic way, which is not possible using the standard lasso. 

In a univariate-output regression setting, sparse regression methods that extend lasso have been 
proposed to allow the recovered relevant inputs to reflect the underlying structural information 
among the inputs. For example, group lasso [12]] assumed that the groupings of the inputs are 
available as a prior knowledge, and used groups of inputs instead of individual inputs as a unit of 
variable selection by applying an Li norm of the lasso penalty over groups of inputs, while using 
an L 2 norm for the input variables within each group. This Li/L 2 norm for group lasso has been 
extended to a more general setting with various types of more complex structures on the sparsity 
pattern rather than a simple grouping information, where the key idea is to allow the groups to 
have an overlap. The hierarchical selection method lfT3ll assumed that the input variables form a 
tree structure, and designed groups so that the child nodes enter the set of relevant inputs only if its 
parent node does. The situations with arbitrary overlapping groups have been considered as well 

Many of these ideas related to group lasso in a univariate regression may be directly applied 
to the multi-task regression problems. The Li/L 2 penalty of group lasso has been used to recover 
inputs that are jointly relevant to all of the outputs, or tasks, where the L 2 norm is applied to the 
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outputs instead of groups of inputs as in group lasso (SI [3. Although the Li/L 2 penalty has been 
shown to be effective in a joint covariate selection in multi-task learning, it assumed that all of the 
tasks are equally related with each other and share the same relavant inputs. However, when there 
is a complex pattern in the way that the tasks are related, only a subset of highly related tasks may 
share the same sparsity pattern in their regression coefficients. In order to address this problem of 
a structured sparsity recovery in a multi-task learning, extensions of group lasso with overlapping 
groups EHIll [51 could be applied. However, the overlapping groups in their regularization methods 
can cause an imbalance among different outputs, since the regression coefficients for an output 
that appears in a large number of groups are more heavily penalized than for other outputs with 
memberships to fewer groups. An ad hoc weighting scheme that weights each group differently in 
the regularization function has been introduced to correct for this imbalance. 

In this paper we consider a particular case of a sparse multi-task regression problem, where the 
outputs can be grouped at multiple granularity. We assume that this multi-level grouping structure 
is encoded as a tree over the outputs with an arbitrary height, where each leaf node represents an 
individual output variable and each internal node indicates the cluster of the output variables that 
correspond to the leaf nodes of the subtree rooted at the given internal node. Each internal node 
in the tree is associated with a weight that represents the height of the subtree, or how tightly the 
outputs in the cluster for that internal node are correlated. As illustrated in Figure [IJa), the outputs 
in each cluster are likely to be influenced by a common set of inputs, and this type of sharing of 
sparsity pattern is stronger among tightly correlated outputs in the cluster with a smaller height in 
the tree. 

In order to achieve this type of structured sparsity at multiple levels of the hierarchy among 
the outputs, we propose a new regularized regression method called tree-guided group lasso that 
defines groups of variables based on a tree which is assumed to be available as prior knowledge. 
The groups are defined at multiple granularity along the tree to encourage a joint covariate selection 
within each cluster of outputs. We describe a weighting scheme that weights each group such that 
clusters of strongly correlated variables are more encouraged to share common inputs than clusters 
with weaker correlation. Compared to an arbitrary assignment of values for the group weights 
which can lead to an inconsistent estimate [5], the weights are systematically defined in terms of 
the heights of the internal nodes in the tree, and each output variable is penalized in a balanced 
manner even if the groups overlap. 

Our work is primarily motivated by the genetic association mapping problem, where the goal is 
to identify a small number of SNPs (inputs) out of millions of SNPs that influence phenotypes (out- 
puts) such as gene expression measurements for thousands of genes. Many previous studies have 
found that multiple genes often participate in the same biological pathways, and are co-expressed 
as a module. Furthermore, evidence has been found that these genes within a module often share 
a common genetic basis that causes the variations in their expression levels EH [2]]. However, 
most of the previous approaches were based on a single-phenotype analysis that treats the multiple 
phenotypes as independent of each other, and there has been a lack of statistical tools that can 
take advantage of this relatedness among multiple genes to identify SNPs that influence the mod- 
ule jointly. In this paper, we apply the hierarchical agglomerative clustering algorithm, a popular 
method for visualizing the clustering structure among the genes, to phenotype data, and use the 
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clustering tree to construct a tree regularization in our regression method. Although this cluster- 
ing tree from the hierarchical agglomerative clustering has been previously used as a structural 
representation of genes in a regression framework, they computed averages over members of the 
cluster for each internal node in the tree, and used these averages as inputs, leading to a potential 
loss of information [3j. In our method, we use the original data with the clustering tree as a guide 
towards a structured sparsity. In our experiments, we demonstrate that our proposed method can 
be successfully applied to select SNPs correlated with multiple genes, using both simulated and 
yeast datasets. 

We begin our discussion with a brief overview of sparse regression methods and multi-task 
learning in Section 2. We describe our proposed method in Section 3, and the optimization al- 
gorithm in Section 4. We present the experimental results using simulated data and yeast data in 
Section 5, and conclude in Section 6. 

2 Background on Sparse Regression and Multi-task Learning 

Let us assume a sample of TV instances, each represented by a J-dimensional input vector and a 
X-dimensional output vector. Let X denote the TV x J input matrix, whose column corresponds to 
observations for the j-th input Xj = {xj, . . . , x^} T . In genetic association mapping, each element 
x l - of the input matrix takes values from {0, 1, 2} according to the number of minor alleles at the 
j-th locus of the i-th individual. Let Y denote the TV x K output matrix, whose column is a vector 
of observations for the fc-th output y k = {y\, . . . , y k } T . For each of the K output variables, we 
assume a linear model: 

y k = X(3 k + e k , Vk = l,...,K, (1) 

where j3 k is a vector of J regression coefficients {f5\, . . . , (3(} T for the fc-th output, and e k is a 
vector of TV independent error terms having mean and a constant variance. We center the y^'s 
and x/s such that y\ = and x % - = 0, and consider the model without an intercept. 

When J is large and the number of inputs relevant to the output is small, lasso offers an effective 
feature selection method for the model in Equation ([I]) ifTOll . Let B = (/3 l5 . . . , (3 K ) denote the 
J x K matrix of regression coefficients of all K outputs. Then, lasso obtains B lasso by solving the 
following optimization problem: 

B lasso = argmin £(y fc - Xf3 k ) T • (y fc - X/3 fc ) + A ]T J] |$|, (2) 

k j k 

where A is a tuning parameter that controls the amount of sparsity in the solution. Setting A to a 
small value leads to a smaller number of non-zero regression coefficients. Clearly, the standard 
lasso in Equation ([2]) offers no mechanism to explicitly couple output variables. 

In multi-task learning, an Li/L 2 penalty has been used to take advantage of the relatedness of 
the outputs and recover the sparsity pattern shared across the related tasks. In an Li/L 2 penalty, an 
L 2 norm is applied to the regression coefficients for all outputs for each input, /3 J , separately, and 
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these J L 2 norms are combined through an Li norm to encourage sparsity across input variables. 
The L1/L2 -penalized multi-task regression is defined as the following optimization problem: 

-qL\/L2 = argmin J](y fc -X/3 fc ) r -(y fc -X/3 fc ) + A^||/3i 2 (3) 

k j 

The Li part of the penalty plays the role of selecting inputs relevant to at least one task, and 
the L 2 part combines information across tasks. Since the L 2 penalty does not have the property of 
encouraging sparsity, if the j-th input is selected as relevant, all of the elements of /3 j take non-zero 
values. Thus, the estimate B Ll / L2 is sparse only across inputs but not across outputs. 

3 Tree-Guided Group Lasso for Sparse Multiple-output Re- 
gression 

The Li/L 2 -penalized regression assumes that all of the outputs in the problem share the common 
set of relevant input variables. Although this method has been shown to be effective under this 
scenario [8l 0, in many real- world applications, the correlation pattern in the multiple outputs 
often has a complex structure such as in gene expression data with subsets of genes forming a 
functional module, and it is not realistic to assume that all of the tasks share the same set of 
relevant inputs as in the Li/L 2 -regularized regression. A subset of highly related outputs may 
share a common set of relevant inputs, whereas weakly related outputs are less likely to be affected 
by the same inputs. 

We assume that the relationships among the outputs can be represented as a tree T with the set 
of vertices V of size \V\, as shown in Figure [IJa), where each of the K leaf nodes is associated 
with an output variable. The internal nodes of the tree represent groupings of the output variables 
located at the leaves of the subtree rooted at the given internal node. Each internal node near the 
bottom of the tree shows that the output variables of its subtree are highly correlated, whereas the 
internal nodes near the root represent weak correlations among the outputs in its subtree. This 
tree structure may be available as a prior knowledge, or can be learned from data using methods 
such as a hierarchical agglomerative clustering. Furthermore, we assume that each node v £ V is 
associated with a weight w v , representing the height of the subtree rooted at v. 

Given this tree T over the outputs, we generalize the L\jL 2 regularization in Equation ^ to 
a tree regularization as follows. We expand the L 2 part of the Li/L 2 penalty into a group-lasso 
penalty, where the group is defined based on tree T as follows. Each node v e V of tree T is 
associated with a group G v whose members consist of all of the output variables (or leaf nodes) in 
the subtree rooted at node v. For example, Figure[T^b) shows the groups associated with each of the 
nodes of the tree in Figure [IJa). Given these groups of outputs that arise from tree T, tree-guided 
group lasso can be written as 

B T = argmin £(y fc - X(3 k ) T • (y fc - X/3 fc ) + A £ £ w v \\f3 Gv \\ 2 , (4) 

k j veV 

where /3 J Gv is a vector of regression coefficients {f3 J k : k e G v }. Each group of regression coeffi- 
cients (3 3 G is weighted with w v so that the group with a large weight is penalized more. 
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Assuming that each internal node v of the tree T is associated with two quantities s v and g v 
that satisfy the condition s v + g v = 1,0 < s Vl g v < 1 > 0, we define w v 9 s in Equation ([4]) in 
terms of s v 's and g v 9 s as we describe below. The s v represents the weight for selecting the output 
variables associated with each of its child nodes separately, whereas the g v represents the weight 
for selecting them jointly. We first consider a simple case with two outputs (K = 2) with a tree of 
three nodes that consist of two leaf nodes (vi and v 2 ) and one root node (v 3 ), and then, generalize 
this to an arbitrary tree. When K = 2, the penalty term in Equation ([4]) can be written as 



EE w ^U 2 = E Ki#'i + + 



(5) 



This is equivalent to an elastic-net penalty [15 ], where (3[ and can be selected either jointly or 
separately according to the weights s 3 and g 3 . When s 3 = 0, the penalty in Equation ([5]) becomes 
equivalent to a ridge-regression penalty, whereas setting g 3 = in Equation ([5]) leads to a lasso 
penalty. In general, when tree T has a height one with the root node having all of the outputs as 
its leaf nodes, the tree-guided group-lasso penalty corresponds to an elastic-net penalty, and the s v 
and g v are weights for the Li and L 2 penalties, respectively. A large value of g v indicates that the 
outputs are highly related, and encourages a joint input selection by heavily weighting the L 2 part 
of the elastic-net penalty. 

When tree T has a height larger than one, we recursively apply the similar operation in Equation 
([5]) starting from the root node towards the leaf nodes as follows: 

E E w v\\Pk n 2 = A E ^Koot), (6) 

j vev j 



where 



Wj(v) 



> v • ^ \Wj (c) | + g v • \\/3q v || 2 if v is an internal node 

cG Children (v) 

\f3 J m \ if v is a leaf node. 



mGG v 

It can be shown that the following relationship holds between w v 's and (s V9 g v ) 9 s. 

Sm if v is an internal node 



gv n 

mG Ancestors (v) 

JJ s m if v is a leaf node. 

mG Ancestors (v) 

The above weighting scheme extends the elastic-net penalty hierarchically, where the L 2 norm of 
the standard elastic-net penalty corresponds to the group-lasso-like L 2 norm in tree-guided group 
lasso. Thus, at each internal node v 9 a large value (small penalization) of s v encourages a separate 
selection of covariates for the outputs associated with the given node v, whereas a large value for 
g v encourages a joint covariate selection across the input. If s v =l and g v = for all v E V, 
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then only separate selections are performed, and the tree-guided group lasso penalty reduces to the 
lasso penalty. On the other hand, if s v =0 and g v = 1 for all v e V, the penalty reduces to the 
Li/L 2 penalty in Equation ^ that performs only a joint covariate selection for all outputs. The 
unit contour surfaces of various penalties for /3{, and f3 J 3 with groups as defined in Figure [I] are 
shown in Figure [2} 

Example 1. Given the tree T in Figure^ for the j-th input the penalty of the tree-guided group 
lasso in Equation @ can be written as follows: 

W J (v 1 ) = \f3{\, Wj(v 2 ) = Wj(v 3 ) = 

W J (v 4 )=g V4 ■ \\f3^J 2 + Sv4 ■ dWjivJl + \Wj(v 2 )\) =g V4 ■ \\f3ij 2 + s V4 ■ (\p{\ + 
W j (v nM ) = W j (v 6 )=g VB • ||/3 J G JI 2 + s, 5 • (I^MI + 1^(^)1) 

= 9v 5 ■ \\Pg V5 \\ 2 + S v 5 • 9v 4 \\Pg V4 \\ 2 + S v 5 -SvAlPH + +SvM- 

Proposition 1. For each of the k-th output, the sum of the weights w v for all nodes v G V inT 
whose group G v contains the k-th output as a member equals one. In other words, the following 
holds: 

s Wv = n Sm+ ^2 gi n Sm = l 

v:k£G v m£Ancestors(vi ea f) l£Ancestors(vi ea f) m^Ancestors(l) 

Proof We assume an ordering of the nodes {v : k G G v } from the leaf v k to the root ^ root , and 
represent the ordered nodes as . . . , v M - Since we have s v + g v = 1 for all v e V, we have 

M M M M M M M 

^2 W v = Yl Sm + ^2 91 II Sjn = Sl Tl S ™ + 9l Yl^m + Yl 91 II Sm 

v:keG v m=l 1=1 m=l+l m=2 m=2 1=2 m=l+l 

M MM M MM 

= +91) - Yi s +Yj 91 n Sm = II s + Y2 91 n Sm = • • • = 1 

m=2 1=2 m=l+l m=2 1=2 m=l+l 

□ 

Proposition 1 states that even if each ouput k belongs to multiple groups associated with inter- 
nal nodes {v : k e G v } and appears multiple times in the overall penalty in Equation ([6]), the sum 
over weights of all of the groups that contain the given output variable is always one. Thus, the 
weighting scheme in Equation ([6]) guarantees that the regression coefficients for all of the outputs 
are penalized equally. In contrast, group lasso with overlapping groups proposed in [0 used an 
arbitrarily defined weights, which was empirically shown to lead to an inconsistent estimate. An- 
other main difference between our method and the work in ^ is that we take advantage of groups 
which contain other groups along the tree structure, whereas they tried to remove such groups as 
redundant in [5 J. 
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(a) (b) (c) (d) (e) (f) 

Figure 2: Unit contour surface for P J 2 ,P J 3 } in various penalties, assuming the tree structure of 
output variables in Figure 1. (a) Lasso, (b) Li/L 2 , ( c ) tree-guided group lasso with gi = 0.5 and 
g 2 = 0.5, (d) g 1 = 0.7 and g 2 = 0.7, (e) g 1 = 0.2 and g 2 = 0.7, and (f) #i = 0.7 and # 2 = 0.2. 

4 Parameter Estimation 

In order to estimate the parameters in tree-guided group lasso, we use the alternative formulation 
of the problem in Equation ([4]) that was previously introduce for group lasso [lj, given as 

B T = argmin J> fc - X/3 fc ) T • (y k - X(3 k ) + a(££>J/3>J| 2 ) 2 . 



j vev 



Since the L\jL 2 norm in the above equation is a non-smooth function, it is not trivial to optimize 
it directly. Using the fact that the variational formulation of a mixed norm regularization is equal 
to a weighted L 2 regularization [9], we re- write the above problem so that it contains only smooth 
functions, as follows: 



B T = argmin ]T(y fc - X/3 fc ) T • (y k - X0 k ) + A ]T ]T 



,211 pt3 II 2 



G v \\ 2 



j vev 



subject to d i> v = X ' d ^ v - 



where we introduced additional variables dj jV 9 s that need to be estimated. We solve the problem in 
the above equation by optimizing (3 k 's and dj iV 's alternately over iterations until convergence. In 
each iteration, we first fix the values for (3 k 9 s and update rf^'s as follows: 



v\\ 2 

j vev 



Then, we hold the values dj jV 's as constant, and update (3 k 's as 

(3 k = (x T X + AD) _1 X T y fc , 

where D is a J x J diagonal matrix with J2 veV w 2 v jd^ v in the j-th element along the diagonal. 
The regularization parameter A can be selected using a cross-validation. 
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(a) 



(b) 



(c) 




(d) 



(e) 



Figure 3: An example of regression coefficients estimated from a simulated dataset. (a) Tree 
structure of the output variables, (b) true regression coefficients, (c) lasso, (d) Li/L 2 , ( e ) tree " 
guided group lasso. The rows represent outputs, and the columns inputs. 

5 Experiments 

We demonstrate the performance of our method on simulated datasets and a yeast dataset of geno- 
types and gene expressions, and compare the performance with those from lasso and the Li/L 2 - 
regularized regression that do not assume any structure among outputs. We evaluate these methods 
based on two criteria, test error and sensitivity/specificity in detecting true relevant inputs. 

5.1 Simulation Study 

We simulate data using the following scenario analogous to genetic association mapping. We 
simulate (X, Y) with K = 60, J = 200 and TV = 150 for the training set as follows. We first 
generate the inputs X by sampling each element in X from a uniform distribution over {0, 1, 2} 
that corresponds to the number of mutated alleles at each genetic locus. Then, we set the values of 
B by first selecting non-zero entries and filling these entries with a pre-defined value. We assume 
a hierarchical structure of height four over the outputs as shown in Figure |3ja), and select the non- 
zero elements of B so that they correspond to the groupings in the sparsity structure given by this 
tree. Figure (3jb) shows the true non-zero elements as white pixels with outputs as rows and inputs 
as columns. Given the X and B, we generate Y with noise distributed as iV(0, 1). 

We fit lasso, the Li/L 2 -regularized regression, and our method to the simulated dataset with 
signal strengths of the non-zero elements of B set to 0.4, and show the results in Figures |3^c)-(e), 
respectively. Since lasso does not have any mechanism to borrow strength across different tasks, 
false positives of the estimated non-zero regression coefficients are distributed randomly across the 
matrix B lasso in FigureBTc). On the other hand, the Li/L 2 regularization method blindly combines 
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Figure 4: ROC curves for the recovery of true non-zero regression coefficients. Results are aver- 
aged over 50 simulated datasets. (a) (3{ = 0.2, (b) (3 3 k = 0.4, and (c) (3{ = 0.6. 



Figure 5: Prediction errors of various regression methods using simulated datasets. Results are 
averaged over 50 simulated datasets. (a) (3 k = 0.2, (b) /3 k = 0.4, and (c) /3 3 k = 0.6. 

information across the outputs regardless of the sparsity structure, and the L 2 penalty over the 
outputs does not encourage sparsity. As a result, once an input is selected as relevant for an output, 
it gets selected for all of the other outputs, which tends to create a vertical stripes of non-zero values 
as shown in Figure [3jd). When the true hierarchical structure in Figure [3ja) is available as prior 
knowledge, it is visually clear from Figure [3je) that our method is able to suppress false positives 
of non-zero regression coefficients, and recover the true underlying sparsity structure significantly 
better than other methods. 

In order to systematically evaluate the performance of the different methods, we generate 50 
simulated datasets, and show in Figure [4] receiver operating characteristic (ROC) curves for the 
recovery of the true sparsity pattern averaged over these datasets. Figures |4|a)-(c) represent results 
from different signal strengths in B of sizes 0.2, 0.4, and 0.6, respectively. Our method clearly 
outperforms lasso and the Li/L 2 regularization method. Especially when the signal strength is 
weak in Figure [4j a )> the advantage of incorporating the prior knowledge of the tree as sparsity 
structure is significant. 
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We compare the performance of the different methods in terms of prediction error, using ad- 
ditional 50 samples as test data, and show the results in Figures [5ja)-(c) for signal strengths of 
sizes 0.2, 0.4, and 0.6, respectively. We find that our method has a lower prediction error than the 
methods that do not incorporate the sparsity pattern across outputs. 

We also consider the scenario where the true tree structure in Figure [3f a) is not known a priori. 
In this case, we learn a tree by running a hierarchical agglomerative clustering on the K x K cor- 
relation matrix of the outputs, and use this tree and the weights h v 's associated with each internal 
node in our method. The weight h v of each internal node v returned by the hierarchical agglomer- 
ative clustering indicates the height of the subtree rooted at the node, or how tightly its members 
are correlated. After normalizing the weights (denoted as h f v ) of all of the internal nodes such that 
the root is at height one, we assign g v = h' v and s v = l — h! v . Since the tree obtained in this manner 
represents a noisy realization of the true underlying tree structure, we discard the nodes for weak 
correlation near the root of the tree by thresholding h! v at p — 0.9 and 0.7, and show the prediction 
errors in Figure [5] as T0.9 and T0.7. Even when the true tree structure is not available, our method 
is able to benefit from taking into account the output sparsity structure, and gives lower prediction 
errors. 

5.2 Analysis of Yeast Data 

We analyze the genotype and gene expression data of 114 yeast strains [fT4l using various sparse 
regression methods. We focus on the chromosome 3 with 21 SNPs and 3684 genes. Although 
it is well established that genes form clusters in terms of expression levels that correspond to 
functional modules, the hierarchical structure over correlated genes is not directly available as a 
prior knowledge, and we learn the tree structure and node weights from the gene expression data 
by running the hierarchical agglomerative clustering algorithm as we described in the previous 
section. We use only the internal nodes with heights h' v < 0.7 or 0.9 in our method. The goal of 
the analysis is to search for SNPs (inputs) whose variation induces a significant variation in the 
gene expression levels (outputs) over different strains. By applying our method that incorporates 
information on gene modules at multiple granularity along the hierarchical clustering tree, we 
expect to be able to identify SNPs that influence a group of genes that are co-expressed or co- 
regulated. 

In Figure [(Ja), we show the K x K correlation matrix of the gene expressions after reordering 
the rows and columns according to the results of the clustering algorithm. The estimated B is 
shown for lasso, the Li/L 2 -regularized regression and our method with p = 0.9 and 0.7 in Figures 
[(Jb)-(e), respectively, where the rows represent genes and the columns SNPs. The lasso estimates 
in Figure |6]^b) are extremely sparse and do not reveal any interesting structure in SNP-gene rela- 
tionships. We believe that the association signals are very weak as is typically the case in a genetic 
association study, and that lasso is unable to detect such weak signals since it does not borrow 
strength across genes. The estimates from the Li/L 2 regularized regression are not sparse across 
genes, and tend to form vertical stripes of non-zero regression coefficients as can be seen in Figure 
[(Jc). Our method in Figures[(Jd)-(e) reveals clear groupings in the patterns of associations between 
genes and SNPs. Our method performs significantly better in terms of prediction errors as can be 
seen in Figure [7} 
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(a) (b) (c) (d) (e) 

Figure 6: Results for the yeast dataset. (a) Correlation matrix of the gene expression data, where 
rows and columns are reordered after applying agglomerative hierarchical clustering. Estimated 
regression coefficients are shown for (b) lasso, (c) Li/L 2 , (d) tree-guided group lasso with p = 0.9, 
and (e) with p = 0.7. In (b)-(e), the rows represent genes (outputs), and the columns markers 
(inputs). 
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Figure 7: Prediction error for the yeast dataset. 
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(a) (b) (c) 

Figure 8: Enrichment of GO category in estimated regression coefficients for the yeast dataset. (a) 
Biological process, (b) molecular function, and (c) cellular component. 
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Given the estimates of B in Figure [6j we look for an enrichment of GO categories among the 
genes with non-zero estimated coefficients for each SNP. A group of genes that form a module often 
participate in the same pathways, leading to an enrichment of a GO category among the members 
of the module. Since we are interested in identifying SNPs influencing gene modules and our 
method reflects this joint association through the hierarchical clustering tree, we hypothesize that 
our method would reveal a more significant GO enrichment in the estimated non-zero elements in 
B. In order to search for a GO enrichment in the results for our method, we use all of the genes with 
non-zero elements in B for each SNP. On the other hand, the estimates of the Li/L 2 regularized 
method are not sparse across genes. Thus, we threshold the absolute values of the estimated B at 
0.005, 0.01, 0.03, and 0.05, and search for GO enrichment only for those genes with (3 3 k above the 
threshold. 

We perform this analysis for each of the three broad GO categories, biological processes, 
molecular functions, and cellular components, and plot the number of SNPs with significant GO 
enrichments at different p- value cutoffs in Figure [8j Regardless of the thresholds for selecting 
significant associations in the Li/L 2 estimates, our method generally finds more significant en- 
richment. 

6 Conclusions 

In this paper, we considered a feature selection problem in a multiple-output regression setting 
when the groupings of the outputs can be defined hierarchically using a tree. We proposed a tree- 
guided group lasso that finds a sparse estimate of regression coefficients while taking into account 
the joint sparsity structure across outputs given by a tree. We demonstrated the performance of our 
method using simulated and yeast datasets. 
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